In this assignment, you will apply some popular machine learning techniques to the problem of classifying data from histological cell images for the diagnosis of malignant breast cancer. This will be presented as a practical scenario where you are approached by a client to solve a problem.
The main aims of this assignment are:
This assignment relates to the following ACS CBOK areas: abstraction, design, hardware and software, data and information, HCI and programming.
This assignment is divided into several tasks. Use the spaces provided in this notebook to answer the questions posed in each task. Note that some questions require writing a small amount of code, some require graphical results, and some require comments or analysis as text. It is your responsibility to make sure your responses are clearly labelled and your code has been fully executed (with the correct results displayed) before submission!
Do not manually edit the data set file we have provided! For marking purposes, it's important that your code runs correctly on the original data file.
Some of the parts of this assignment build on the workflow from the first assignment and that part of the course, and so less detailed instructions are provided for this, as you should be able to implement this workflow now without low-level guidance. A substantial portion of the marks for this assignment are associated with making the right choices and executing this workflow correctly and efficiently. Make sure you have clean, readable code as well as producing outputs, since your coding will also count towards the marks (however, excessive commenting is discouraged and will lose marks, so aim for a modest, well-chosen amount of comments and text in outputs).
This assignment can be solved using methods from sklearn, pandas, and matplotlib as presented in the workshops. Other libraries should not be used (even though they might have nice functionality) and certain restrictions on sklearn functions will be made clear in the instruction text. You are expected to search and carefully read the documentation for functions that you use, to ensure you are using them correctly.
A client approaches you to solve a machine learning problem for them. They run a pathology lab that processes histological images for healthcare providers and they have created a product that measures the same features as in the Wisconsin breast cancer data set though using different acquisitions and processing methods. This makes their method much faster than existing ones, but it is also slightly noisier. They want to be able to diagnose malignant cancer (and distinguish them from benign growths) by employing machine learning techniques, and they have asked you to implement this for them.
Their requirements are: 1) have at least a 95% probability of detecting malignant cancer when it is present; 2) have no more than 1 in 10 healthy cases (those with benign tumours) labelled as positive (malignant).
They have hand-labelled 300 samples for you, which is all they have at the moment.
Please follow the instructions below, which will vary in level of detail, as appropriate to the marks given.
# This code imports some libraries that you will need.
# You should not need to modify it, though you are expected to make other imports later in your code.
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)
# Common imports
import numpy as np
import time
# Pandas for overview
import pandas as pd
# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
from sklearn import tree
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
# Plot setup
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=7)
mpl.rc('xtick', labelsize=6)
mpl.rc('ytick', labelsize=6)
mpl.rc('figure', dpi=240)
plt.close('all')
import seaborn as sns
Do this from the csv file, assignment2.csv, as done in assignment 1 and workshops 2 and 3. Extract the feature names and label names for use later on. Note that we will be treating the malignant case as our positive case, as this is the standard convention in medicine.
Print out some information (in text) about the data, to verify that the loading has worked and to get a feeling for what is present in the dataset and the range of the values.
Also, graphically show the proportions of the labels in the whole dataset.
# Your code here
# Read data from CSV
dataset = pd.read_csv('assignment2.csv')
# Some information (in text) about the data
# 1. Display dataset
display(dataset)
# 2. Display data range
display(dataset.describe().T)
# 3. Display data type
display(dataset.info())
# 4. Display number of na records
print("Check number of na records:")
display(dataset.isna().sum())
| label | mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | malignant | 15.494654 | 15.902542 | 103.008265 | 776.437239 | 0.104239 | 0.168660 | 0.170572 | 0.085668 | 0.205053 | ... | 19.522957 | 22.427276 | 135.128520 | 1286.903131 | 0.142725 | 0.407483 | 0.445992 | 0.171662 | 0.353211 | 0.097731 |
| 1 | malignant | 16.229871 | 18.785613 | 105.176755 | 874.712003 | 0.091843 | 0.092548 | 0.081681 | 0.053670 | 0.180435 | ... | 19.140235 | 24.905156 | 123.886045 | 1234.499997 | 0.129135 | 0.223918 | 0.248846 | 0.136735 | 0.284427 | 0.085758 |
| 2 | malignant | 16.345671 | 20.114076 | 107.083804 | 872.563251 | 0.099924 | 0.123799 | 0.128788 | 0.078310 | 0.189756 | ... | 19.144816 | 25.601433 | 125.113036 | 1202.749973 | 0.135017 | 0.314402 | 0.332505 | 0.161497 | 0.313038 | 0.084340 |
| 3 | malignant | 13.001009 | 19.876997 | 85.889775 | 541.281012 | 0.113423 | 0.173069 | 0.146214 | 0.069574 | 0.212078 | ... | 15.565911 | 26.145119 | 102.958265 | 737.655082 | 0.161390 | 0.485912 | 0.430007 | 0.167254 | 0.432297 | 0.117705 |
| 4 | malignant | 16.416060 | 17.397533 | 107.857386 | 891.516818 | 0.097321 | 0.111530 | 0.125971 | 0.068575 | 0.179562 | ... | 18.620376 | 22.306233 | 124.002529 | 1139.490971 | 0.133950 | 0.230996 | 0.316620 | 0.131715 | 0.269591 | 0.080497 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 295 | benign | 14.048464 | 17.186671 | 90.974271 | 637.474225 | 0.094969 | 0.091549 | 0.063532 | 0.039494 | 0.173324 | ... | 15.790651 | 22.538529 | 103.423320 | 819.408970 | 0.126466 | 0.206701 | 0.192139 | 0.095350 | 0.287380 | 0.078520 |
| 296 | benign | 12.879033 | 16.767790 | 83.123369 | 539.225356 | 0.092146 | 0.083986 | 0.059347 | 0.035404 | 0.167690 | ... | 14.358919 | 21.955513 | 93.620160 | 684.694077 | 0.118165 | 0.191978 | 0.180949 | 0.083989 | 0.263879 | 0.078279 |
| 297 | malignant | 13.123052 | 18.793057 | 84.897717 | 555.002209 | 0.098036 | 0.090178 | 0.066586 | 0.043711 | 0.172389 | ... | 14.991646 | 24.820718 | 97.933068 | 726.695117 | 0.126203 | 0.201766 | 0.202433 | 0.100361 | 0.256863 | 0.079667 |
| 298 | benign | 14.411991 | 18.970674 | 93.423809 | 671.128126 | 0.086304 | 0.090118 | 0.070882 | 0.039482 | 0.175789 | ... | 16.555187 | 25.591332 | 108.978466 | 893.818250 | 0.120338 | 0.246945 | 0.236415 | 0.105354 | 0.280900 | 0.081828 |
| 299 | benign | 12.704174 | 20.895143 | 82.227859 | 528.052132 | 0.098300 | 0.093698 | 0.068184 | 0.038141 | 0.178533 | ... | 14.199113 | 25.377961 | 93.143286 | 681.453918 | 0.125313 | 0.195607 | 0.192059 | 0.085053 | 0.265963 | 0.078269 |
300 rows × 31 columns
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| mean radius | 300.0 | 14.231808 | 1.297393 | 11.560025 | 13.356676 | 13.976933 | 15.103078 | 19.090091 |
| mean texture | 300.0 | 19.312619 | 1.572224 | 15.349270 | 18.194791 | 19.220652 | 20.245660 | 26.836291 |
| mean perimeter | 300.0 | 92.727687 | 8.949937 | 74.690886 | 86.659535 | 90.896982 | 99.093762 | 126.168030 |
| mean area | 300.0 | 664.367372 | 129.515717 | 477.371592 | 580.383274 | 628.004851 | 737.444716 | 1300.788708 |
| mean smoothness | 300.0 | 0.096937 | 0.005067 | 0.084651 | 0.093305 | 0.096722 | 0.099995 | 0.114500 |
| mean compactness | 300.0 | 0.106615 | 0.020819 | 0.075184 | 0.091105 | 0.102401 | 0.117334 | 0.192880 |
| mean concavity | 300.0 | 0.092591 | 0.030312 | 0.050771 | 0.069071 | 0.084829 | 0.107994 | 0.212704 |
| mean concave points | 300.0 | 0.050820 | 0.014350 | 0.028701 | 0.039507 | 0.046744 | 0.060606 | 0.105212 |
| mean symmetry | 300.0 | 0.182546 | 0.010754 | 0.157059 | 0.175353 | 0.181685 | 0.187789 | 0.226448 |
| mean fractal dimension | 300.0 | 0.062841 | 0.002736 | 0.057830 | 0.060950 | 0.062477 | 0.064149 | 0.076091 |
| radius error | 300.0 | 0.416393 | 0.104913 | 0.298005 | 0.347373 | 0.382932 | 0.453831 | 1.287142 |
| texture error | 300.0 | 1.216924 | 0.200404 | 0.898026 | 1.078944 | 1.183021 | 1.299564 | 2.561348 |
| perimeter error | 300.0 | 2.938814 | 0.784066 | 2.059186 | 2.452051 | 2.687216 | 3.160189 | 9.707670 |
| area error | 300.0 | 41.668095 | 16.512927 | 27.693748 | 32.438720 | 35.810512 | 44.720929 | 214.346096 |
| smoothness error | 300.0 | 0.007060 | 0.001169 | 0.004994 | 0.006348 | 0.006770 | 0.007554 | 0.015803 |
| compactness error | 300.0 | 0.026110 | 0.007470 | 0.016907 | 0.020859 | 0.024072 | 0.028725 | 0.064581 |
| concavity error | 300.0 | 0.032803 | 0.013074 | 0.018730 | 0.026071 | 0.030034 | 0.035882 | 0.163592 |
| concave points error | 300.0 | 0.012046 | 0.002414 | 0.007253 | 0.010509 | 0.011673 | 0.013134 | 0.026554 |
| symmetry error | 300.0 | 0.020750 | 0.003406 | 0.016181 | 0.018503 | 0.019825 | 0.021823 | 0.041861 |
| fractal dimension error | 300.0 | 0.003860 | 0.001146 | 0.002631 | 0.003254 | 0.003575 | 0.004072 | 0.013251 |
| worst radius | 300.0 | 16.460566 | 1.798202 | 13.279265 | 15.148044 | 16.007171 | 17.656889 | 22.676185 |
| worst texture | 300.0 | 25.772128 | 2.346310 | 20.144214 | 24.058893 | 25.689861 | 27.333610 | 34.614459 |
| worst perimeter | 300.0 | 108.563914 | 12.500033 | 87.110184 | 99.229249 | 105.540619 | 116.274995 | 150.353232 |
| worst area | 300.0 | 900.644633 | 209.738842 | 633.771881 | 752.124790 | 828.667704 | 1011.628413 | 1796.820974 |
| worst smoothness | 300.0 | 0.133424 | 0.008678 | 0.110342 | 0.127682 | 0.133064 | 0.138650 | 0.164583 |
| worst compactness | 300.0 | 0.261732 | 0.063535 | 0.167098 | 0.215767 | 0.247022 | 0.298732 | 0.543118 |
| worst concavity | 300.0 | 0.282075 | 0.079831 | 0.152272 | 0.219671 | 0.267894 | 0.325278 | 0.635074 |
| worst concave points | 300.0 | 0.118146 | 0.024552 | 0.066927 | 0.098389 | 0.115679 | 0.136687 | 0.179794 |
| worst symmetry | 300.0 | 0.293620 | 0.025620 | 0.240341 | 0.277676 | 0.288994 | 0.305227 | 0.432297 |
| worst fractal dimension | 300.0 | 0.084556 | 0.007427 | 0.072745 | 0.079636 | 0.082610 | 0.087645 | 0.128288 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 300 entries, 0 to 299 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 label 300 non-null object 1 mean radius 300 non-null float64 2 mean texture 300 non-null float64 3 mean perimeter 300 non-null float64 4 mean area 300 non-null float64 5 mean smoothness 300 non-null float64 6 mean compactness 300 non-null float64 7 mean concavity 300 non-null float64 8 mean concave points 300 non-null float64 9 mean symmetry 300 non-null float64 10 mean fractal dimension 300 non-null float64 11 radius error 300 non-null float64 12 texture error 300 non-null float64 13 perimeter error 300 non-null float64 14 area error 300 non-null float64 15 smoothness error 300 non-null float64 16 compactness error 300 non-null float64 17 concavity error 300 non-null float64 18 concave points error 300 non-null float64 19 symmetry error 300 non-null float64 20 fractal dimension error 300 non-null float64 21 worst radius 300 non-null float64 22 worst texture 300 non-null float64 23 worst perimeter 300 non-null float64 24 worst area 300 non-null float64 25 worst smoothness 300 non-null float64 26 worst compactness 300 non-null float64 27 worst concavity 300 non-null float64 28 worst concave points 300 non-null float64 29 worst symmetry 300 non-null float64 30 worst fractal dimension 300 non-null float64 dtypes: float64(30), object(1) memory usage: 72.8+ KB
None
Check number of na records:
label 0 mean radius 0 mean texture 0 mean perimeter 0 mean area 0 mean smoothness 0 mean compactness 0 mean concavity 0 mean concave points 0 mean symmetry 0 mean fractal dimension 0 radius error 0 texture error 0 perimeter error 0 area error 0 smoothness error 0 compactness error 0 concavity error 0 concave points error 0 symmetry error 0 fractal dimension error 0 worst radius 0 worst texture 0 worst perimeter 0 worst area 0 worst smoothness 0 worst compactness 0 worst concavity 0 worst concave points 0 worst symmetry 0 worst fractal dimension 0 dtype: int64
# Extract feature names
features = dataset.columns[1:].tolist()
print(f"List of feature names ({len(features)} items):", features)
print()
# Extract labels
label_value_counts = dataset['label'].value_counts()
labels = label_value_counts.index.tolist()
print(f"List of label names ({len(labels)} items):", labels)
# Print number of each labels
print(label_value_counts)
# Graphically show the proportions of the labels in the whole dataset.
plt.figure(dpi=300)
sns.set()
ax = sns.countplot(x=dataset['label'])
# Calculate percentage of each label and add percentage label in the figure
total = len(dataset)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width()/2
y = p.get_height()//2
ax.annotate(percentage, (x, y), ha='center')
plt.show()
List of feature names (30 items): ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension'] List of label names (2 items): ['benign', 'malignant'] benign 154 malignant 146 Name: label, dtype: int64
As this data is well curated by the client already, you do not need to worry about outliers, missing values or imputation in this case, but be aware that this is the exception, not the rule.
To familiarise yourself with the nature and information contained in the data, display histograms for the data according to the following instructions:

# Code fragment to help with plotting histograms combining matplotlib and seaborn (and pandas)
def plot_histograms(features, Nrows, Ncols):
fig, axes = plt.subplots(Nrows, Ncols, figsize=(20, 10))
for row in range(Nrows):
for col in range(Ncols):
feature = features[row*Ncols + col]
sns.histplot(data=dataset, x=feature, hue="label", bins='auto', kde=True, ax=axes[row,col], edgecolor=None)
plt.show()
# Your code here
plot_histograms(features[:10], 2, 5)
plot_histograms(features[10:20], 2, 5)
plot_histograms(features[20:30], 2, 5)
Based on the histograms, which do you think are the 3 strongest features for discriminating between the classes?
# Your answer here
# Convert label into 1 and 0.
# Treating the malignant case as the positive case, as this is the standard convention in medicine.
dataset['label'].replace(
['malignant', 'benign'],
[1, 0],
inplace=True
)
# Show correlation between features and label
correlation = dataset.corr()
# Show correlation between top features (abs of correlation is greater than 0.6)
# Create dataframe of correlation matrix
df_correlation = pd.DataFrame(
correlation.unstack().sort_values(ascending=False)['label'],
columns=['Correlation to the target']
)
# Create column 'Abs of correlation'
df_correlation['Abs of correlation'] = abs(df_correlation['Correlation to the target'])
# Sort by 'Abs of correlation'
df_correlation = df_correlation.sort_values(by='Abs of correlation', ascending=False, axis=0)
# Filter features with abs of correlation greater than 0.6
df_correlation = df_correlation.loc[df_correlation['Abs of correlation'] > 0.6]
display(df_correlation)
# Get top features
top_corr_features = correlation.index[abs(correlation["label"])>0.6]
# Show heatmap
plt.figure(figsize=(10,10))
sns.heatmap(dataset[top_corr_features].corr(), annot=True, cmap="RdYlGn")
| Correlation to the target | Abs of correlation | |
|---|---|---|
| label | 1.000000 | 1.000000 |
| worst concave points | 0.778575 | 0.778575 |
| worst perimeter | 0.764033 | 0.764033 |
| worst radius | 0.755703 | 0.755703 |
| mean concave points | 0.729886 | 0.729886 |
| mean perimeter | 0.710967 | 0.710967 |
| worst area | 0.705986 | 0.705986 |
| mean radius | 0.698323 | 0.698323 |
| mean area | 0.664719 | 0.664719 |
| mean concavity | 0.623550 | 0.623550 |
| worst concavity | 0.622537 | 0.622537 |
<AxesSubplot:>
Based on the correlation, the 3 strongest features for discriminating between the classes are:
This is because these 3 features have the highest correlation to the label.
Split the dataset into appropriate subsets. You must choose what the subsets are and how big they are. However, we want to make sure the proportion of the two classes is consistent across all datasets, so use the stratify option, as used in workshops 5 and 6. Verify the size and label distribution in each dataset.
# Your code here
from sklearn.model_selection import train_test_split
# Split dataset into training set, validation set and test set
bigtrain_set, test_set = train_test_split(dataset, test_size=0.2, random_state=42, stratify=dataset['label'])
train_set, val_set = train_test_split(bigtrain_set, test_size=0.2, random_state=42, stratify=bigtrain_set['label'])
X_train = train_set.drop("label",axis=1)
y_train = train_set["label"].copy()
X_test = test_set.drop("label",axis=1)
y_test = test_set["label"].copy()
X_val = val_set.drop("label",axis=1)
y_val = val_set["label"].copy()
print(f'Shape of the whole dataset: {dataset.shape}')
print()
print("Verify the size in each dataset")
print(f'\t1. Shape of the training dataset: {train_set.shape}')
print(f'\t2. Shape of the validation dataset: {val_set.shape}')
print(f'\t3. Shape of the test dataset: {test_set.shape}')
print()
print("Verify label distribution in each dataset")
print(f'\t1. Mean of label in training dataset: {np.mean(y_train)}')
print(f'\t2. Mean of label in validation dataset: {np.mean(y_val)}')
print(f'\t3. Mean of label in test dataset: {np.mean(y_test)}')
Shape of the whole dataset: (300, 31) Verify the size in each dataset 1. Shape of the training dataset: (192, 31) 2. Shape of the validation dataset: (48, 31) 3. Shape of the test dataset: (60, 31) Verify label distribution in each dataset 1. Mean of label in training dataset: 0.4895833333333333 2. Mean of label in validation dataset: 0.4791666666666667 3. Mean of label in test dataset: 0.48333333333333334
Build a pre-processing pipeline that includes imputation (as even though we don't strictly need it here it is a good habit to always include it) and other appropriate pre-processing.
# Your code here
from sklearn.impute import SimpleImputer
preprocessing_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('scaler', StandardScaler())
])
For our classification task we will consider three simple baseline cases: 1) predicting all samples to be negative (class 1) 2) predicting all samples to be positive (class 2) 3) making a random prediction for each sample with equal probability for each class
For each case measure and display the following metrics:
Code is given below for the latter metrics (all metrics are discussed in lecture 4 and many are in workshop 4).
Also calculate and display the confusion matrix for each baseline case, using a heatmap and numbers (as in workshop 4).
from sklearn.metrics import fbeta_score, make_scorer
f10_scorer = make_scorer(fbeta_score, beta=10)
f01_scorer = make_scorer(fbeta_score, beta=0.1)
def f10_score(yt,yp):
return fbeta_score(yt, yp, beta=10)
def f01_score(yt,yp):
return fbeta_score(yt, yp, beta=0.1)
# Your code here
from numpy.random import default_rng
# Generate random sample with equal probability for each class
# Default random generator
rng = default_rng()
# Generate a random series contains random indexes from y_val
indexes = rng.choice(y_val.shape[0], size=int(y_val.shape[0] / 2), replace=False)
# Create random y test
random_y_test = np.zeros(y_val.shape[0])
# Replace random index with 1
random_y_test[indexes] = 1
print(random_y_test)
[0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 0.]
# 3 baseline cases
# 1. all samples to be negative (class 1)
# 2. all samples to be positive (class 2)
# 3. random prediction for each sample with equal probability for each class
from sklearn.metrics import confusion_matrix, accuracy_score, balanced_accuracy_score, recall_score, precision_score, \
roc_auc_score, f1_score, roc_curve, precision_recall_curve, hinge_loss, auc
baseline_cases = [{
'name' : '1. Predicting all samples to be negative (class 1)',
'y_predict' : np.zeros(y_val.shape[0])
}, {
'name' : '2. Predicting all samples to be positive (class 2)',
'y_predict' : np.ones(y_val.shape[0])
}, {
'name' : '3. Random prediction for each sample with equal probability for each class',
'y_predict' : random_y_test
}]
def report_baseline_case(y_val, case):
plt.rcParams['figure.dpi'] = 240
plt.rcParams['savefig.dpi'] = 240
y_predict = case['y_predict']
name = case['name']
# Print measures
print(f'Case {name}\n')
# Balanced accuracy
print(f'Balanced accuracy: {balanced_accuracy_score(y_val, y_predict)}')
# Recall, precision
print(f'Recall: {recall_score(y_val, y_predict, zero_division=1)}')
print(f'Precision: {precision_score(y_val, y_predict, zero_division=1)}')
# AUC
fpr, tpr, thresholds = roc_curve(y_val, y_predict, pos_label=1)
print(f'AUC: {auc(fpr, tpr)}')
# F1 score
print(f'F1 score: {f1_score(y_val, y_predict)}')
# fbeta_score with beta=0.1
print(f'Fbeta score with beta = 0.1: {f01_score(y_val, y_predict)}')
# fbeta_score with beta=10
print(f'Fbeta score with beta = 10: {f10_score(y_val, y_predict)}')
# Calculate and display confusion matrix
cmat = confusion_matrix(y_true=y_val, y_pred=y_predict)
sns.heatmap(cmat,annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
print()
for index, case in enumerate(baseline_cases):
report_baseline_case(y_val, case)
Case 1. Predicting all samples to be negative (class 1) Balanced accuracy: 0.5 Recall: 0.0 Precision: 1.0 AUC: 0.5 F1 score: 0.0 Fbeta score with beta = 0.1: 0.0 Fbeta score with beta = 10: 0.0
Case 2. Predicting all samples to be positive (class 2) Balanced accuracy: 0.5 Recall: 1.0 Precision: 0.4791666666666667 AUC: 0.5 F1 score: 0.647887323943662 Fbeta score with beta = 0.1: 0.48165042504665145 Fbeta score with beta = 10: 0.9893526405451447
Case 3. Random prediction for each sample with equal probability for each class Balanced accuracy: 0.5208695652173914 Recall: 0.5217391304347826 Precision: 0.5 AUC: 0.5208695652173914 F1 score: 0.5106382978723404 Fbeta score with beta = 0.1: 0.5002063557573256 Fbeta score with beta = 10: 0.5215146299483648
Based on the above baseline tests and the client's requirements, choose a performance metric to use for evaluating/driving your machine learning methods. Give a reason for your choice.
Based on the requirements:
=> recall or TPR >= 95%
=> FPR <= 10%
=> Recall (TPR) and FPR are importance in this scenario
Because AUC could reflect both TPR and FPR, AUC is chosen to be the performance metric to use for evaluating.
For a stronger baseline, train and evaluate the Stochastic Gradient Descent classifier (as seen in workshop 5). For this baseline case use the default settings for all the hyperparameters.
# Your code here
from sklearn.linear_model import SGDClassifier
sgd_pipeline = Pipeline([
('preprocess', preprocessing_pipeline),
('sgd', SGDClassifier())
])
# Train
sgd_pipeline.fit(X_train,y_train)
# Predict
y_val_pred = sgd_pipeline.predict(X_val)
# Display Metric AUC
fpr, tpr, thresholds = roc_curve(y_val, y_val_pred, pos_label=1)
print(f'AUC: {auc(fpr, tpr)}')
# ROC curve
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.plot(fpr,tpr,'r')
plt.title('ROC curve')
plt.show()
# Error binary
plt.plot(np.abs(y_val - y_val_pred),'r.')
plt.xlabel('Sample Number')
plt.ylabel('Error Status')
plt.title('Errors - Binary')
plt.show()
AUC: 0.9565217391304348
Calculate and display the normalized version of the confusion matrix. From this calculate the probability that a sample from a person with a malignant tumour is given a result that they do not have cancer. Which of the client's two criteria does this relate to, and is this baseline satisfying this criterion or not?
# Your code here
cmat = confusion_matrix(y_true=y_val, y_pred=y_val_pred, normalize='true')
sns.heatmap(cmat, annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Normalised Confusion Matrix')
plt.show()
The probability that a sample from a person with a malignant tumour is given a result that they do not have cancer is 8.7%
It is related to criteria 1: Have at least a 95% probability of detecting malignant cancer when it is present
It is not satisfying because the probability of detecting malignant cancer when it is present is only 91%
Train and optimise the hyperparameters to give the best performance for each of the following classifiers:
Follow best practice as much as possible here. You must make all the choices and decisions yourself, and strike a balance between computation time and performance.
You can use any of the sci-kit learn functions in sklearn.model_selection.cross* and anything used in workshops 3, 4, 5 and 6. Other hyper-parameter optimisation functions apart from these cannot be used (even if they are good and can be part of best practice in other situations - for this assignment everyone should assume they only have very limited computation resources and limit themselves to these functions).
Display the performance of the different classifiers and the optimised hyperparameters.
Based on these results, list the best 3 classifiers and indicate if you think any perform equivalently.
# Your code here
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVR, SVC
from sklearn.tree import DecisionTreeClassifier
# 1. KNN (K-Nearest Neighbour) classifier
knn_pl = Pipeline([
('preprocess',preprocessing_pipeline),
('knn',KNeighborsClassifier())
])
knn_params = dict(
knn__n_neighbors=[3,5,10,20,30,50],
knn__weights=['uniform','distance']
)
# 2. Decision tree classifier
dt_pl = Pipeline([
('preprocess',preprocessing_pipeline),
('dt',DecisionTreeClassifier())
])
dt_params = dict(
dt__max_depth=range(1,20),
dt__min_samples_leaf=range(1,20)
)
# 3. Support vector machine classifier
svc_pl = Pipeline([
('preprocess',preprocessing_pipeline),
('svc',SVC())
])
svc_params = dict(
svc__kernel=['linear','rbf','poly']
)
# 4. SGD classifier
sgd_pl = Pipeline([
('preprocess',preprocessing_pipeline),
('sgd',SGDClassifier())
])
sgd_params = dict(
sgd__alpha=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
sgd__penalty=['l2','l1','elasticnet'],
sgd__learning_rate=['constant','optimal'],
sgd__eta0=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
)
# List models
dict_models = dict()
dict_models["KNN (K-Nearest Neighbour) classifier"] = (knn_pl, knn_params)
dict_models["Decision tree classifier"] = (dt_pl, dt_params)
dict_models["Support vector machine classifier"] = (svc_pl, svc_params)
dict_models["SGD classifier"] = (sgd_pl, sgd_params)
# Scorer
scorer = 'roc_auc'
# List score of models
dict_scores = dict()
def report_model(pl, params, model_name, y_true):
cv = GridSearchCV(
pl,
params,
cv=5,
n_jobs=-1,
scoring=scorer,
return_train_score=True)
cv.fit(X_train,y_train)
print(model_name)
print(f'Best params for {model_name} : {cv.best_params_}')
# Using best estimator to predict
cv.best_estimator_.fit(X_train,y_train)
y_pred = cv.best_estimator_.predict(X_val)
# AUC
fpr, tpr, thresholds = roc_curve(y_true, y_pred, pos_label=1)
auc_score = auc(fpr, tpr)
print(f'AUC: {auc_score}')
# ROC curve
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.plot(fpr,tpr,'r')
plt.title('ROC curve')
plt.show()
# Calculate and display confusion matrix
cmat = confusion_matrix(y_true=y_true, y_pred=y_pred)
sns.heatmap(cmat,annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
# Add score to dictionary
dict_scores[model_name] = (auc_score, cv.best_estimator_, cv.best_params_)
print()
return auc_score
for key, value in dict_models.items():
pl = value[0]
params = value[1]
report_model(pl, params, key, y_val)
KNN (K-Nearest Neighbour) classifier
Best params for KNN (K-Nearest Neighbour) classifier : {'knn__n_neighbors': 20, 'knn__weights': 'distance'}
AUC: 0.9365217391304348
Decision tree classifier
Best params for Decision tree classifier : {'dt__max_depth': 5, 'dt__min_samples_leaf': 7}
AUC: 0.8930434782608696
Support vector machine classifier
Best params for Support vector machine classifier : {'svc__kernel': 'rbf'}
AUC: 0.9365217391304348
SGD classifier
Best params for SGD classifier : {'sgd__alpha': 0.1, 'sgd__eta0': 0.0001, 'sgd__learning_rate': 'optimal', 'sgd__penalty': 'l2'}
AUC: 0.9565217391304348
for key, value in dict_scores.items():
print(f'{key}')
print(f'\tAUC = {value[0]}')
KNN (K-Nearest Neighbour) classifier AUC = 0.9365217391304348 Decision tree classifier AUC = 0.8930434782608696 Support vector machine classifier AUC = 0.9365217391304348 SGD classifier AUC = 0.9565217391304348
# Your answer here
print(f"""
List of the best 3 classifiers based on evaluation on the validation set:
1. SGD classifier with the highest AUC nearly {100*round(dict_scores['SGD classifier'][0], 2)}%
2. KNN classifier and Support vector machine classifier perform equivalent with AUC around {100*round(dict_scores['KNN (K-Nearest Neighbour) classifier'][0], 2)}%
These three classifier performs equally great on this dataset, but the size of validation set is too small. If the validation dataset is larger, performance measurement will be more accuracy.
""")
List of the best 3 classifiers based on evaluation on the validation set: 1. SGD classifier with the highest AUC nearly 96.0% 2. KNN classifier and Support vector machine classifier perform equivalent with AUC around 94.0% These three classifier performs equally great on this dataset, but the size of validation set is too small. If the validation dataset is larger, performance measurement will be more accuracy.
Choose the best classifier (as seen in workshops 3 to 6) and give details of your hyperparameter settings. Explain the reason for your choice.
# Your answer here
# index 0: auc score
# index 1: cv.best_estimator_
# index 2: cv.best_params_
sgd = dict_scores['SGD classifier']
print(f'hyperparameter settings : {sgd[2]}')
model = sgd[1]
print(model)
hyperparameter settings : {'sgd__alpha': 0.1, 'sgd__eta0': 0.0001, 'sgd__learning_rate': 'optimal', 'sgd__penalty': 'l2'}
Pipeline(steps=[('preprocess',
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])),
('sgd', SGDClassifier(alpha=0.1, eta0=0.0001))])
Your answer here
The SGD classifier has been selected as the best model in this stage because from the above classifier comparison figure, the SGD classifier has the highest AUC score on the validation set
Calculate and display an unbiased performance measure that you can present to the client.
Is your chosen classifier underfitting or overfitting?
Does your chosen classifier meet the client's performance criteria?
# Your code here
print(model)
# Get the whole training set
X_train_whole = bigtrain_set.drop("label",axis=1)
y_train_whole = bigtrain_set["label"].copy()
# Train model
model.fit(X_train_whole, y_train_whole)
# Predict on test set
y_test_pred = model.predict(X_test)
# AUC
fpr, tpr, thresholds = roc_curve(y_test, y_test_pred, pos_label=1)
print(f'AUC: {auc(fpr, tpr)}')
# ROC curve
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.plot(fpr,tpr,'r')
plt.title('ROC curve')
plt.show()
# Error binary
plt.plot(np.abs(y_test - y_test_pred),'r.')
plt.xlabel('Sample Number')
plt.ylabel('Error Status')
plt.title('Errors - Binary')
plt.show()
# Calculate and display confusion matrix
cmat = confusion_matrix(y_true=y_test, y_pred=y_test_pred, normalize='true')
sns.heatmap(cmat,annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
Pipeline(steps=[('preprocess',
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])),
('sgd', SGDClassifier(alpha=0.1, eta0=0.0001))])
AUC: 0.9666295884315907
Requirements:
Have at least a 95% probability of detecting malignant cancer when it is present (recall or TPR >= 95%)
Have no more than 1 in 10 healthy cases (those with benign tumours) labelled as positive (malignant) (FPR <= 10%)
According to the evaluation metrics for test set, the model produces 97% on recall and AUC and lower than 1% on FPR. Therefore, we can say that this model is neither underfitting nor overfitting.
=> This model meet the client's performance criteria.
Although it is only possible to know the true usefulness of a feature when you've combined it with others in a machine learning method, it is still helpful to have some measure for how discriminative each feature is on its own. One common method for doing this is to calculate a T-score (often used in statistics, and in the LDA machine learning method) for each feature.
The formula for the T-score is (mean(x2) - mean(x1))/(0.5*(stddev(x2) + stddev(x1))), where x1 and x2 are the datasets corresponding to the two classes. Large values for the T-score (either positive or negative) indicate discriminative ability.
Calculate the T-score for each feature and print out the best 4 features according to this score.
# Your code here
# Dataset contains only positive label
dataset_pos = dataset.where(dataset['label'] == 1).dropna()
dataset_pos.drop('label',axis=1,inplace=True)
# Dataset contains only negative label
dataset_neg = dataset.where(dataset['label'] == 0).dropna()
dataset_neg.drop('label',axis=1,inplace=True)
# Check dataset shape
dataset_pos.shape, dataset_neg.shape
((146, 30), (154, 30))
def calculate_t_score(x1, x2):
return (x1.mean() - x2.mean())/(0.5*(x1.std() + x2.std()))
list_t_score = []
# Calculate t score for all features
for feature in features:
t_score = calculate_t_score(dataset_pos[feature], dataset_neg[feature])
list_t_score.append((feature, t_score))
list_t_score.sort(key=lambda tup: tup[1], reverse=True)
print('The best 4 features according to t_score')
for item in list_t_score[:4]:
print(f'{item[0]}: t_score = {item[1]}')
The best 4 features according to t_score worst concave points: t_score = 2.4871607332953354 worst perimeter: t_score = 2.473073212650544 worst radius: t_score = 2.4108393809617583 mean concave points: t_score = 2.229435693128837
Display the decision boundaries for each pair of features from the best 4 chosen above. You can use the functions below to help (taken from workshop 6).
Instead of using the simple mean as the input for xmean in plot_contours, use the following: 0.5*(mean(x1) + mean(x2)), where x1 and x2 are the data associated with the two classes. This way of calculating a "mean" point takes into account any imbalance between the classes.
def make_meshgrid(x, y, ns=100):
"""Create a mesh of points to plot in
Parameters
----------
x: data to base x-axis meshgrid on (only min and max used)
y: data to base y-axis meshgrid on (only min and max used)
ns: number of steps in grid, optional
Returns
-------
xx, yy : ndarray
"""
x_min, x_max = x.min(), x.max()
y_min, y_max = y.min(), y.max()
hx = (x_max - x_min)/ns
hy = (y_max - y_min)/ns
xx, yy = np.meshgrid(np.arange(x_min - 50*hx, x_max + 50*hx, hx), np.arange(y_min - 50*hy, y_max + 50*hy, hy))
return xx, yy
def plot_contours(clf, xx, yy, xmean, n1, n2, **params):
"""Plot the decision boundaries for a classifier.
Parameters
----------
clf: a classifier
xx: meshgrid ndarray
yy: meshgrid ndarray
xmean : 1d array of mean values (to populate constant features with)
n1, n2: index numbers of features that change (for xx and yy)
params: dictionary of params to pass to contourf, optional
"""
# The following lines makes an MxN matrix to pass to the classifier (# samples x # features)
# It does this by multiplying Mx1 and 1xN matrices, where the former is filled with 1's
# where M is the number of grid points in xx and N is the number of features in xmean
# It is done in such a way that the xmean vector is replaced in each row
fullx = np.ones((xx.ravel().shape[0],1)) * np.reshape(xmean,(1,-1))
fullx[:,n1] = xx.ravel()
fullx[:,n2] = yy.ravel()
Z = clf.predict(fullx)
Z = Z.reshape(xx.shape)
out = plt.contourf(xx, yy, Z, **params)
return out
# Your code here
def calculate_x_mean(x1,x2):
return 0.5* (np.mean(x1,axis=0) + np.mean(x2,axis=0))
# Get the best 4 feature names
feature_names = [item[0] for item in list_t_score[:4]]
# Setting figsize
fig=plt.figure(figsize=(20,20), dpi=300)
# Train model sgd with the whole training dataset
sgd = dict_scores['SGD classifier']
best_model = sgd[1]
print(best_model)
best_model.fit(X_train_whole.values, y_train_whole.values)
X_train_local = X_train.copy().values
X_valid_local = X_val.copy().values
# Calculate x mean by using the following: 0.5*(mean(x1) + mean(x2))
X_mean = calculate_x_mean(dataset_pos.values, dataset_neg.values)
# Visualize the decision boundary between 2 features
def decision_boundary(feature_one, feature_two):
n0 = X_train.columns.get_loc(feature_one)
n1 = X_train.columns.get_loc(feature_two)
x10, x90 = np.percentile(X_train_local[:,n0],[10,90])
y10, y90 = np.percentile(X_train_local[:,n1],[10,90])
xx, yy = make_meshgrid(np.array([x10, x90]), np.array([y10, y90]), 500)
# Plot
plot_contours(
best_model,
xx,
yy,
X_mean,
n0,
n1,
cmap=plt.cm.coolwarm,
alpha=0.8
)
plt.scatter(
X_train_local[:,n0],
X_train_local[:,n1],
c=y_train.values,
cmap=plt.cm.coolwarm,
s=20,
edgecolors="k"
)
plt.scatter(
X_valid_local[:,n0],
X_valid_local[:,n1],
c=y_val.values,
cmap=plt.cm.coolwarm,
s=20,
edgecolors="k",
marker='s'
)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xlabel(f"Feature {feature_one}")
plt.ylabel(f"Feature {feature_two}")
plt.title("Decision Boundary")
# Visualize the decision boundary between each pair of features
count = 0
for i in range(4):
for j in range(4):
if (i > j):
# Setting figure
ax=fig.add_subplot(3,2,count+1)
# Show decision boundary for each pair of features
decision_boundary(feature_names[i],feature_names[j])
# labels of figure
plt.xlabel(feature_names[i])
plt.ylabel(feature_names[j])
else:
count = count-1
count = count+1
plt.show()
Pipeline(steps=[('preprocess',
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])),
('sgd', SGDClassifier(alpha=0.1, eta0=0.0001))])
From the decision boundaries displayed above, would you expect the method to extrapolate well or not? Give reasons for your answer.
A decision boundary is the region of a problem space in which the output label of a classifier is ambiguous. If the decision surface is a hyperplane, then the classification issue is linear, and the classes are linearly separable. Decision boundaries are not always clear-cut.
The decision boundary divides the forecast into blue and red regions, while the ground truth is represented by a dot on the graphic. The findings indicate that the decision boundary of my prefered model, the SGD classifier, extrapolates well with reasonable errors for these problems.
After presenting your initial results to the client they come back to you and say that they have done some financial analysis and it would save them a lot of time and money if they did not have to analyse every cell, which is needed to get the "worst" features. Instead, they can quickly get accurate estimates for the "mean" and "standard error" features from a much smaller, randomly selected set of cells.
They ask you to give them a performance estimate for the same problem, but without using any of the "worst" features.
Calculate an unbiased performance estimate for this new problem, as requested by the client.
# Your code here
dataset_without_worst = dataset.copy()
dataset_without_worst = dataset_without_worst[
dataset_without_worst.columns.drop(list(dataset_without_worst.filter(regex='worst')))
]
dataset_without_worst.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 300 entries, 0 to 299 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 label 300 non-null int64 1 mean radius 300 non-null float64 2 mean texture 300 non-null float64 3 mean perimeter 300 non-null float64 4 mean area 300 non-null float64 5 mean smoothness 300 non-null float64 6 mean compactness 300 non-null float64 7 mean concavity 300 non-null float64 8 mean concave points 300 non-null float64 9 mean symmetry 300 non-null float64 10 mean fractal dimension 300 non-null float64 11 radius error 300 non-null float64 12 texture error 300 non-null float64 13 perimeter error 300 non-null float64 14 area error 300 non-null float64 15 smoothness error 300 non-null float64 16 compactness error 300 non-null float64 17 concavity error 300 non-null float64 18 concave points error 300 non-null float64 19 symmetry error 300 non-null float64 20 fractal dimension error 300 non-null float64 dtypes: float64(20), int64(1) memory usage: 49.3 KB
# Split dataset
bigtrain_set_without_worst, test_set_without_worst = train_test_split(dataset_without_worst, test_size=0.2, random_state=42, stratify=dataset_without_worst['label'])
train_set_without_worst, val_set_without_worst = train_test_split(bigtrain_set_without_worst, test_size=0.2, random_state=42, stratify=bigtrain_set_without_worst['label'])
# Verify the size and label distribution in each dataset.
X_train_without_worst = train_set_without_worst.drop("label",axis=1)
y_train_without_worst = train_set_without_worst["label"].copy()
X_test_without_worst = test_set_without_worst.drop("label",axis=1)
y_test_without_worst = test_set_without_worst["label"].copy()
X_val_without_worst = val_set_without_worst.drop("label",axis=1)
y_val_without_worst = val_set_without_worst["label"].copy()
print(f'Shape of the whole dataset: {dataset_without_worst.shape}')
print()
print("Verify the size in each dataset")
print(f'\t1. Shape of the training dataset: {train_set_without_worst.shape}')
print(f'\t2. Shape of the validation dataset: {val_set_without_worst.shape}')
print(f'\t3. Shape of the test dataset: {test_set_without_worst.shape}')
print()
print("Verify label distribution in each dataset")
print(f'\t1. Mean of label in training dataset: {np.mean(y_train_without_worst)}')
print(f'\t2. Mean of label in validation dataset: {np.mean(y_val_without_worst)}')
print(f'\t3. Mean of label in test dataset: {np.mean(y_test_without_worst)}')
Shape of the whole dataset: (300, 21) Verify the size in each dataset 1. Shape of the training dataset: (192, 21) 2. Shape of the validation dataset: (48, 21) 3. Shape of the test dataset: (60, 21) Verify label distribution in each dataset 1. Mean of label in training dataset: 0.4895833333333333 2. Mean of label in validation dataset: 0.4791666666666667 3. Mean of label in test dataset: 0.48333333333333334
sgd_pipeline = Pipeline([
('preprocess', preprocessing_pipeline),
('sgd', SGDClassifier())
])
# Train
sgd_pipeline.fit(X_train_without_worst,y_train_without_worst)
# Predict
y_val_pred = sgd_pipeline.predict(X_val_without_worst)
# Display Metric AUC
fpr, tpr, thresholds = roc_curve(y_val_without_worst, y_val_pred, pos_label=1)
print(f'AUC: {auc(fpr, tpr)}')
# ROC curve
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.plot(fpr,tpr,'r')
plt.title('ROC curve')
plt.show()
# Error binary
plt.plot(np.abs(y_val_without_worst - y_val_pred),'r.')
plt.xlabel('Sample Number')
plt.ylabel('Error Status')
plt.title('Errors - Binary')
plt.show()
# Confusion matrix
cmat = confusion_matrix(y_true=y_val_without_worst, y_pred=y_val_pred, normalize='true')
sns.heatmap(cmat, annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Normalised Confusion Matrix')
plt.show()
AUC: 0.8965217391304349
for key, value in dict_models.items():
pl = value[0]
params = value[1]
report_model(pl, params, key, y_val_without_worst)
KNN (K-Nearest Neighbour) classifier
Best params for KNN (K-Nearest Neighbour) classifier : {'knn__n_neighbors': 20, 'knn__weights': 'distance'}
AUC: 0.9365217391304348
Decision tree classifier
Best params for Decision tree classifier : {'dt__max_depth': 4, 'dt__min_samples_leaf': 7}
AUC: 0.8930434782608696
Support vector machine classifier
Best params for Support vector machine classifier : {'svc__kernel': 'rbf'}
AUC: 0.9365217391304348
SGD classifier
Best params for SGD classifier : {'sgd__alpha': 0.0001, 'sgd__eta0': 0.1, 'sgd__learning_rate': 'optimal', 'sgd__penalty': 'l2'}
AUC: 0.9565217391304348
the_best_model_name = None
the_best_auc = None
for key, value in dict_scores.items():
if the_best_model_name is None:
the_best_model_name = key
the_best_auc = value[0]
elif the_best_auc < value[0]:
the_best_model_name = key
the_best_auc = value[0]
print(f'{key}')
print(f'\tAUC = {value[0]}')
KNN (K-Nearest Neighbour) classifier AUC = 0.9365217391304348 Decision tree classifier AUC = 0.8930434782608696 Support vector machine classifier AUC = 0.9365217391304348 SGD classifier AUC = 0.9565217391304348
print(f"The best model is {the_best_model_name} with AUC = {the_best_auc}")
The best model is SGD classifier with AUC = 0.9565217391304348
model_score = dict_scores[the_best_model_name]
print(f'hyperparameter settings : {model_score[2]}')
model = model_score[1]
print(model)
# Get the whole training set
X_train_whole_without_worst = bigtrain_set_without_worst.drop("label",axis=1)
y_train_whole_without_worst = bigtrain_set_without_worst["label"].copy()
# Train model
model.fit(X_train_whole_without_worst, y_train_whole_without_worst)
# Predict on test set
y_test_pred = model.predict(X_test_without_worst)
# AUC
fpr, tpr, thresholds = roc_curve(y_test_without_worst, y_test_pred, pos_label=1)
print(f'AUC: {auc(fpr, tpr)}')
# ROC curve
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.plot(fpr,tpr,'r')
plt.title('ROC curve')
plt.show()
# Error binary
plt.plot(np.abs(y_test_without_worst - y_test_pred),'r.')
plt.xlabel('Sample Number')
plt.ylabel('Error Status')
plt.title('Errors - Binary')
plt.show()
# Calculate and display confusion matrix
cmat = confusion_matrix(y_true=y_test_without_worst, y_pred=y_test_pred, normalize='true')
sns.heatmap(cmat,annot=True)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
hyperparameter settings : {'sgd__alpha': 0.0001, 'sgd__eta0': 0.1, 'sgd__learning_rate': 'optimal', 'sgd__penalty': 'l2'}
Pipeline(steps=[('preprocess',
Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])),
('sgd', SGDClassifier(eta0=0.1))])
AUC: 0.899888765294772
Do you think the new classifier, that does not use the "worst" features, is:
Give reasons for your answer.